Reward Estimation for Dialogue Policy Optimisation

نویسندگان

Pei-Hao Su

Milica Gašić

Steve Young

چکیده

Viewing dialogue management as a reinforcement learning task enables a system to learn to act optimally by maximising a reward function. This reward function is designed to induce the system behaviour required for the target application and for goal-oriented applications, this usually means fulfilling the user’s goal as efficiently as possible. However, in real-world spoken dialogue system applications, the reward is hard to measure because the user’s goal is frequently known only to the user. Of course, the system can ask the user if the goal has been satisfied but this can be intrusive. Furthermore, in practice, the accuracy of the user’s response has been found to be highly variable. This paper presents two approaches to tackling this problem. Firstly, a recurrent neural network is utilised as a task success predictor which is pre-trained from off-line data to estimate task success during subsequent on-line dialogue policy learning. Secondly, an on-line learning framework is described whereby a dialogue policy is jointly trained alongside a reward function modelled as a Gaussian process with active learning. This Gaussian process operates on a fixed dimension embedding which encodes each varying length dialogue. This dialogue embedding is generated in both a supervised and unsupervised fashion using different variants of a recurrent neural network. The experimental results demonstrate the effectiveness of both off-line and on-line methods. These methods enable practical on-line training of dialogue policies in real-world applications.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reward Shaping for Statistical Optimisation of Dialogue Management

This paper investigates the impact of reward shaping on a reinforcement learning-based spoken dialogue system’s learning. A diffuse reward function gives a reward after each transition between two dialogue states. A sparse function only gives a reward at the end of the dialogue. Reward shaping consists of learning a diffuse function without modifying the optimal policy compared to a sparse one....

متن کامل

Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation

This paper argues that the problems of dialogue management (DM) and Natural Language Generation (NLG) in dialogue systems are closely related and can be fruitfully treated statistically, in a joint optimisation framework such as that provided by Reinforcement Learning (RL). We first review recent results and methods in automatic learning of dialogue management strategies for spoken and multimod...

متن کامل

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems

The ability to compute an accurate reward function is essential for optimising a dialogue policy via reinforcement learning. In real-world applications, using explicit user feedback as the reward signal is often unreliable and costly to collect. This problem can be mitigated if the user’s intent is known in advance or data is available to pre-train a task success predictor off-line. In practice...

متن کامل

Natural belief-critic: a reinforcement algorithm for parameter estimation in statistical spoken dialogue systems

This paper presents a novel algorithm for learning parameters in statistical dialogue systems which are modelled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy which selects the system’s responses based on the inferred state; and a reward function which speci...

متن کامل

Reinforcement learning for parameter estimation in statistical spoken dialogue systems

Reinforcement techniques have been successfully used to maximise the expected cumulative reward of statistical dialogue systems. Typically, reinforcement learning is used to estimate the parameters of a dialogue policy which selects the system’s responses based on the inferred dialogue state. However, the inference of the dialogue state itself depends on a dialogue model which describes the exp...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Reward Estimation for Dialogue Policy Optimisation

نویسندگان

چکیده

منابع مشابه

Reward Shaping for Statistical Optimisation of Dialogue Management

Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation

On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems

Natural belief-critic: a reinforcement algorithm for parameter estimation in statistical spoken dialogue systems

Reinforcement learning for parameter estimation in statistical spoken dialogue systems

عنوان ژورنال:

اشتراک گذاری